Goto

Collaborating Authors

 edge server


EVER: Edge-Assisted Auto-Verification for Mobile MR-Aided Operation

Chen, Jiangong, Zhu, Mingyu, Li, Bin

arXiv.org Artificial Intelligence

Mixed Reality (MR)-aided operation overlays digital objects on the physical world to provide a more immersive and intuitive operation process. A primary challenge is the precise and fast auto-verification of whether the user follows MR guidance by comparing frames before and after each operation. The pre-operation frame includes virtual guiding objects, while the post-operation frame contains physical counterparts. Existing approaches fall short of accounting for the discrepancies between physical and virtual objects due to imperfect 3D modeling or lighting estimation. In this paper, we propose EVER: an edge-assisted auto-verification system for mobile MR-aided operations. Unlike traditional frame-based similarity comparisons, EVER leverages the segmentation model and rendering pipeline adapted to the unique attributes of frames with physical pieces and those with their virtual counterparts; it adopts a threshold-based strategy using Intersection over Union (IoU) metrics for accurate auto-verification. To ensure fast auto-verification and low energy consumption, EVER offloads compute-intensive tasks to an edge server. Through comprehensive evaluations of public datasets and custom datasets with practical implementation, EVER achieves over 90% verification accuracy within 100 milliseconds (significantly faster than average human reaction time of approximately 273 milliseconds), while consuming only minimal additional computational resources and energy compared to a system without auto-verification.


Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding

Koh, Jungyeon, Yang, Hyun Jong

arXiv.org Artificial Intelligence

The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.


An Efficient Privacy-preserving Intrusion Detection Scheme for UAV Swarm Networks

Gharami, Kanchon, Moni, Shafika Showkat

arXiv.org Artificial Intelligence

The rapid proliferation of unmanned aerial vehicles (UAVs) and their applications in diverse domains, such as surveillance, disaster management, agriculture, and defense, have revolutionized modern technology. While the potential benefits of swarm-based UAV networks are growing significantly, they are vulnerable to various security attacks that can jeopardize the overall mission success by degrading their performance, disrupting decision-making, and compromising the trajectory planning process. The Intrusion Detection System (IDS) plays a vital role in identifying potential security attacks to ensure the secure operation of UAV swarm networks. However, conventional IDS primarily focuses on binary classification with resource-intensive neural networks and faces challenges, including latency, privacy breaches, increased performance overhead, and model drift. This research aims to address these challenges by developing a novel lightweight and federated continuous learning-based IDS scheme. Our proposed model facilitates decentralized training across diverse UAV swarms to ensure data heterogeneity and privacy. The performance evaluation of our model demonstrates significant improvements, with classification accuracies of 99.45% on UKM-IDS, 99.99% on UAV-IDS, 96.85% on TLM-UAV dataset, and 98.05% on Cyber-Physical datasets.


SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference

Chen, Qian, Chen, Xianhao, Huang, Kaibin

arXiv.org Artificial Intelligence

Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage burden for an edge device. To address this challenge, we consider a scenario where experts are dispersed across an edge network for distributed inference. Based on the popular Top-$K$ expert selection strategy, we formulate a latency minimization problem by optimizing expert caching on edge servers under storage constraints. When $K=1$, the problem reduces to a monotone submodular maximization problem with knapsack constraints, for which we design a greedy-based algorithm with a $(1 - 1/e)$-approximation guarantee. For the general case where $K \geq 1$, expert co-activation within the same MoE layer introduces non-submodularity, which renders greedy methods ineffective. To tackle this issue, we propose a successive greedy decomposition method to decompose the original problem into a series of subproblems, with each being solved by a dynamic programming approach. Furthermore, we design an accelerated algorithm based on the max-convolution technique to obtain the approximate solution with a provable guarantee in polynomial time. Simulation results on various MoE models demonstrate that our method significantly reduces inference latency compared to existing baselines.


A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning

Jin, Liuyi, Gunawardena, Pasan, Haroon, Amran, Wang, Runzhi, Lee, Sangwoo, Stoleru, Radu, Middleton, Michael, Huo, Zepeng, Kim, Jeeeun, Moats, Jason

arXiv.org Artificial Intelligence

Emergency Medical Technicians (EMTs) operate in high-pressure environments, making rapid, life-critical decisions under heavy cognitive and operational loads. We present EMSGlass, a smart-glasses system powered by EMSNet, the first multimodal multitask model for Emergency Medical Services (EMS), and EMSServe, a low-latency multimodal serving framework tailored to EMS scenarios. EMSNet integrates text, vital signs, and scene images to construct a unified real-time understanding of EMS incidents. Trained on real-world multimodal EMS datasets, EMSNet simultaneously supports up to five critical EMS tasks with superior accuracy compared to state-of-the-art unimodal baselines. Built on top of PyTorch, EMSServe introduces a modality-aware model splitter and a feature caching mechanism, achieving adaptive and efficient inference across heterogeneous hardware while addressing the challenge of asynchronous modality arrival in the field. By optimizing multimodal inference execution in EMS scenarios, EMSServe achieves 1.9x -- 11.7x speedup over direct PyTorch multimodal inference. A user study evaluation with six professional EMTs demonstrates that EMSGlass enhances real-time situational awareness, decision-making speed, and operational efficiency through intuitive on-glass interaction. In addition, qualitative insights from the user study provide actionable directions for extending EMSGlass toward next-generation AI-enabled EMS systems, bridging multimodal intelligence with real-world emergency response workflows.


SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving

Li, Xiangchen, Spatharakis, Dimitrios, Ghafouri, Saeid, Fan, Jiakun, Vandierendonck, Hans, John, Deepu, Ji, Bo, Nikolopoulos, Dimitrios

arXiv.org Artificial Intelligence

The growing gap between the increasing complexity of large language models (LLMs) and the limited computational budgets of edge devices poses a key challenge for efficient on-device inference, despite gradual improvements in hardware capabilities. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new framework that leverages speculative decoding, previously viewed primarily as a decoding acceleration technique for autoregressive generation of LLMs, as a promising approach specifically adapted for edge computing by orchestrating computation across heterogeneous devices. We propose \acronym, a framework that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models, while a single, shared edge server verifies the tokens utilizing a more precise target model. To further increase the efficiency of verification, the edge server batch the diverse verification requests from devices. This approach supports device heterogeneity and reduces server-side memory footprint by sharing the same upstream target model across multiple devices. Our initial experiments with Jetson Orin Nano, Raspberry Pi 4B/5, and an edge server equipped with 4 Nvidia A100 GPUs indicate substantial benefits: 2.2 more system throughput, 2.8 more system capacity, and better cost efficiency, all without sacrificing model accuracy.


EPARA: Parallelizing Categorized AI Inference in Edge Clouds

Wang, Yubo, Cui, Yubo, Shi, Tuo, Li, Danyang, Li, Wenxin, Suo, Lide, Wang, Tao, Xie, Xin

arXiv.org Artificial Intelligence

With the increasing adoption of AI applications such as large language models and computer vision AI, the computational demands on AI inference systems are continuously rising, making the enhancement of task processing capacity using existing hardware a primary objective in edge clouds. We propose EPARA, an end-to-end AI parallel inference framework in edge, aimed at enhancing the edge AI serving capability. Our key idea is to categorize tasks based on their sensitivity to latency/frequency and requirement for GPU resources, thereby achieving both request-level and service-level task-resource allocation. EPARA consists of three core components: 1) a task-categorized parallelism allocator that decides the parallel mode of each task, 2) a distributed request handler that performs the calculation for the specific request, and 3) a state-aware scheduler that periodically updates service placement in edge clouds. We implement a EPARA prototype and conduct a case study on the EPARA operation for LLMs and segmentation tasks. Evaluation through testbed experiments involving edge servers, embedded devices, and microcomputers shows that EPARA achieves up to 2.1$\times$ higher goodput in production workloads compared to prior frameworks, while adapting to various edge AI inference tasks.


Bayes-Split-Edge: Bayesian Optimization for Constrained Collaborative Inference in Wireless Edge Systems

Safaeipour, Fatemeh Zahra, Chakareski, Jacob, Hashemi, Morteza

arXiv.org Artificial Intelligence

Mobile edge devices (e.g., AR/VR headsets) typically need to complete timely inference tasks while operating with limited on-board computing and energy resources. In this paper, we investigate the problem of collaborative inference in wireless edge networks, where energy-constrained edge devices aim to complete inference tasks within given deadlines. These tasks are carried out using neural networks, and the edge device seeks to optimize inference performance under energy and delay constraints. The inference process can be split between the edge device and an edge server, thereby achieving collaborative inference over wireless networks. We formulate an inference utility optimization problem subject to energy and delay constraints, and propose a novel solution called Bayes-Split-Edge, which leverages Bayesian optimization for collaborative split inference over wireless edge networks. Our solution jointly optimizes the transmission power and the neural network split point. The Bayes-Split-Edge framework incorporates a novel hybrid acquisition function that balances inference task utility, sample efficiency, and constraint violation penalties. We evaluate our approach using the VGG19 model on the ImageNet-Mini dataset, and Resnet101 on Tiny-ImageNet, and real-world mMobile wireless channel datasets. Numerical results demonstrate that Bayes-Split-Edge achieves up to 2.4x reduction in evaluation cost compared to standard Bayesian optimization and achieves near-linear convergence. It also outperforms several baselines, including CMA-ES, DIRECT, exhaustive search, and Proximal Policy Optimization (PPO), while matching exhaustive search performance under tight constraints. These results confirm that the proposed framework provides a sample-efficient solution requiring maximum 20 function evaluations and constraint-aware optimization for wireless split inference in edge computing systems.


Reinforcement Learning-Driven Edge Management for Reliable Multi-view 3D Reconstruction

Mounesan, Motahare, Saha, Sourya, Gan, Houchao, Absur, Md. Nurul, Debroy, Saptarshi

arXiv.org Artificial Intelligence

Real-time multi-view 3D reconstruction is a mission-critical application for key edge-native use cases, such as fire rescue, where timely and accurate 3D scene modeling enables situational awareness and informed decision-making. However, the dynamic and unpredictable nature of edge resource availability introduces disruptions, such as degraded image quality, unstable network links, and fluctuating server loads, which challenge the reliability of the reconstruction pipeline. In this work, we present a reinforcement learning (RL)-based edge resource management framework for reliable 3D reconstruction to ensure high quality reconstruction within a reasonable amount of time, despite the system operating under a resource-constrained and disruption-prone environment. In particular, the framework adopts two cooperative Q-learning agents, one for camera selection and one for server selection, both of which operate entirely online, learning policies through interactions with the edge environment. To support learning under realistic constraints and evaluate system performance, we implement a distributed testbed comprising lab-hosted end devices and FABRIC infrastructure-hosted edge servers to emulate smart city edge infrastructure under realistic disruption scenarios. Results show that the proposed framework improves application reliability by effectively balancing end-to-end latency and reconstruction quality in dynamic environments. Multi-view 3D reconstruction of unknown scenes is becoming a foundational capability for mission-critical smart city applications, such as emergency response and public safety, providing situational awareness for rapid, informed decision making [1], [2]. This technique aggregates images from multiple viewpoints to generate 3D representations, typically as point clouds or meshes.


Federated Split Learning for Resource-Constrained Robots in Industrial IoT: Framework Comparison, Optimization Strategies, and Future Directions

Ni, Wanli, Tian, Hui, Wang, Shuai, Li, Chengyang, Sun, Lei, Yang, Zhaohui

arXiv.org Artificial Intelligence

Abstract--Federated split learning (FedSL) has emerged as a promising paradigm for enabling collaborative intelligence in industrial Internet of Things (IoT) systems, particularly in smart factories where data privacy, communication efficiency, and device heterogeneity are critical concerns. In this article, we present a comprehensive study of FedSL frameworks tailored for resource-constrained robots in industrial scenarios. We compare synchronous, asynchronous, hierarchical, and heterogeneous FedSL frameworks in terms of workflow, scalability, adaptability, and limitations under dynamic industrial conditions. Furthermore, we systematically categorize token fusion strategies into three paradigms: input-level (pre-fusion), intermediate-level (intra-fusion), and output-level (post-fusion), and summarize their respective strengths in industrial applications. We also provide adaptive optimization techniques to enhance the efficiency and feasibility of FedSL implementation, including model compression, split layer selection, computing frequency allocation, and wireless resource management. Finally, we outline open issues and research directions of FedSL in future smart manufacturing systems. The rapid evolution of the industrial Internet of Things (IoT) has catalyzed a paradigm shift toward intelligent, autonomous, and interconnected manufacturing systems [1]. At the heart of this transformation are networked robots that perform complex tasks such as quality inspection, predictive maintenance, and multi-device collaboration across dynamic production environments. These robots are increasingly equipped with multimodal sensors and onboard computing units, enabling them to perceive, reason, and act in real time [2].